SASSIFI: Evaluating Resilience of GPU Applications

نویسندگان

  • Siva Kumar Sastry Hari
  • Timothy Tsai
  • Mark Stephenson
  • Stephen W. Keckler
  • Joel Emer
چکیده

As GPUs become more pervasive in both scalable high-performance computing systems and safety-critical embedded systems, evaluating and analyzing their resilience will grow increasingly important. As soft errors, such as those caused by high-energy particle strikes, form an important fraction of in-field hardware errors, GPU designers must develop tools and techniques to understand the effect of these soft errors on applications. This paper presents an error injection-based methodology to study the soft-error resilience of massively parallel applications running on state-of-the-art NVIDIA GPUs. Our approach uses a low-level assembly-language instrumentation tool called SASSI to profile and inject errors. SASSI provides efficiency by allowing instrumentation code to execute entirely on the GPU and provides the ability to inject into condition code and predicate registers, in addition to general-purpose registers and GPU memory. This paper describes our error injection tool and presents some experiments to illustrate some possible lines of analysis. We injected errors into Rodinia benchmark applications and provide results from those experiments showing average detected and silent error probabilities for applications, static kernels, and dynamic kernel invocations. For applications with multiple invocations of the same static kernel, we also show how our tool can be used to study error propagation as a function of the injection time. We also study the effect of errors on condition code and predicate registers.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Error Resilience Evaluation on GPGPU Applications

While graphics processing units (GPUs) have gained wide adoption as accelerators for general-purpose applications (GPGPU), the end-to-end reliability implications of their use have not been quantified. Fault injection is a widely used method for evaluating the reliability of applications. However, building a fault injector for GPGPU applications is challenging due to their massive parallelism, ...

متن کامل

Evaluating the Error Resilience of GPGPU Applications

Over the past years, GPUs (Graphics Processing Units) have gained wide adoption as accelerators for general purpose computing. A number of studies [1, 2] have shown that significant performance gains can be achieved by deploying GPUs on traditional high performance computing (HPC) systems that host demanding scientific applications. However, the reliability implications of using GPUs are unclea...

متن کامل

Implementation of the direction of arrival estimation algorithms by means of GPU-parallel processing in the Kuda environment (Research Article)

Direction-of-arrival (DOA) estimation of audio signals is critical in different areas, including electronic war, sonar, etc. The beamforming methods like Minimum Variance Distortionless Response (MVDR), Delay-and-Sum (DAS), and subspace-based Multiple Signal Classification (MUSIC) are the most known DOA estimation techniques. The mentioned methods have high computational complexity. Hence using...

متن کامل

Defining and Evaluating Resilience: A Performability Perspective

The notion of system “resilience” is receiving increased attention in domains ranging from safety-critical applications to ubiquitous computing. After reviewing how resilience has been defined in these contexts, we discuss roles that performability can play in both its definition and evaluation.

متن کامل

Risk assessment of resilience engineering level and integrated resilience engineering with safety, health and environment of hospitals (case study: two selected hospitals of Faraja in 1400)

Introduction: By evaluating the level of resilience engineering and resilience engineering combined with safety, health, and environment, the priorities of medical centers to face crises are determined. Therefore, this study aimed to implement a risk assessment model with a checklist designed in a fuzzy environment. Materials and Methods: This cross-sectional study, in 2014, in two Faraja hosp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015